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(54) Selection of acoustic models using speaker verification 



(57) Usually in a speaker adaptive system, every 
time a change in speaker occurs, he/she has to chose 
which of the available model sets to use. E.g. the SI 
model set, if it is the first time he/she is using the system 
or if a model set already adapted to him, if he/she used 
it before. If adapted model sets are not stored at ail, the 
adaptation process starts over and over again using the 



SI models, if the same speaker uses the system repeat- 
edly. According to the invention a change in speaker is 
be detected automatically. Furtheron, the system iden- 
tifies the speaker and if he/she had used the system be- 
fore and a speaker adapted model to him/her is already 
available. If this is the case, this model set will be taken 
for further recognition and adaptation. 
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Description 

[0001] This invention is related to a method and a de- 
vice to perform automatic speech recognition, in partic- 
ular to a method and a device to increase the recognition 
rate in speech recognition systems that are used by dif- 
ferent users. 

[0002] State of the art speech recognizers consist of 
a set of statistical distribustions modeling the acoustic 
properties of certain speech segments. These acoustic 
properties are encoded in feature vectors. As an exam- 
ple, one Gaussian distribution can be taken for each 
phoneme. These distributions are attached to states. A 
(stochastic) state transition network (usually hidden 
Markov models) defines the probabilities for sequences 
of states and sequences of feature vectors. Passing a 
state consumes one feature vector covering a frame of 
e.g. 10 ms of the speech signal. 
[0003] The stochastic parameters of such a recogniz- 
er are trained using a large amount of speech data either 
from a single speaker yielding a speaker dependent 
(SD) system or from many speakers yielding a speaker 
independent (SI) system. 

[0004] Speaker adaptation (SA) is a widely used 
method to increase recognition rates of SI systems. 
State of the art speaker dependent systems yield much 
higher recognition rates than speaker independent sys- 
tems. However, for many applications, it is not feasible 
to gather enough data from a single speaker to train the 
system. In case of a consumer device this might even 
not be wanted. To overcome this mismatch in recogni- 
tion rates, speaker adaptation algorithms are widely 
used in order to achieve recognition rates that come 
close to speaker dependent systems, but only use a 
fraction of speaker dependent data compared to speak- 
er dependent ones. These systems initially take speaker 
independent models that are then adapted so as to bet- 
ter match the speakers acoustics. 
[0005] Usually, the adaptation is performed in super- 
vised mode. That is the spoken words are known and 
the recognizer is forced to recognize them. Herewith a 
time alignment of the segment-specific distributions is 
achieved. The mismatch between the actual feature 
vectors and the parameters of the corresponding distri- 
bution builds the basis for the adaptation. The super- 
vised adaptation requires an adaptation session to be 
done with every new speaker before he/ she can actu- 
ally use the recognizer. 

[0006] Usually, the speaker adaptation techniques 
modify the parameters of the hidden Markov models so 
that they better match the acoustic characteristics of 
new speakers. Normally, in batch or off-line adaptation 
a speaker has to read a pre-defined text before he/she 
can use the system for recognition, which is then proc- 
essed to do the adaptation. Once this is finished the sys- 
tem can be used for recognition. This mode is also called 
supervised adaptation, since the text was known to the 
system and a forced alignment of the corresponding 



speech signal to the models corresponding to the text 
is performed and used for adaptation. 
[0007] However, an unsupervised or on-line method 
is better suited for most kinds of consumer devices. In 
5 this case, adaptation takes place while the system is in 
use. The recognized utterance is used for adaptation 
and the modified models are used for recognizing the 
next utterance and so on. In this case the spoken text 
is not known to the system, but the word(s) that were 
to recognized are taken instead. 

[0008] An adapatation of the speaker adapted model 
set can be repreatedly performed to further improve the 
performance of specific speakers. There are several ex- 
isting methods for speaker adaptation, e.g. maximum a _ 
is posteriori adaptation (MAP) or maximum likelihood lin- 
ear regression (MLLR) adaptation. 
[0009] However, these speaker adaptive speech rec- 
ognition systems, especially systems working with un- 
supervised adaptation, are always adaptated to one 
speaker only. Therefore, if the speaker changes, adap- 
tation has to be restarted (using the SI models) for this 
new speaker before he/she can use the system with an 
improved recognition rate. 

[0010] Speaker adaptation techniques are widely 
used in many kinds of speech recognition systems, e.g. 
dictation systems. In some of these systems it is possi- 
ble to store the speaker adapted models, so that differ- 
ent speakers can use the system with different speaker 
adapted models. But each time it has to be specified 
manually which of the adapted models to use. 
[0011] On the other hand, it is known that speaker ver- 
ification and identification techniques are used for ac- 
cess control of e.g. buildings or systems. 
[0012] Therefore, it is the object underlying the 
present invention to propose a method and a device for 
speaker adaptation that overcomes the problems de- 
scribed above. 

[0013] The inventive method is defined in independ- 
ent claim 1 and the inventive device is defined in inde- 
pendent claim 5. Preferred embodiments thereof are re- 
spectively defined in the respective following dependent 
claims. 

[0014] As mentioned above, according to the prior art 
adaptation has to be restarted using the speaker inde- 
pendent (SI) models again if there is change in speaker. 
[0015] When talking about a home or car environment 
there will be a change in speaker quite often, but it will 
be a more or less fixed set of speakers, e.g. the mem- 
bers of a family. So it is not very reasonable to start ad- 
aptation all over again every time one of the speakers 
starts using the system and discard all previous adap- 
tation to specific speakers. 

[0016] According to the present invention, on the oth- 
er hand, the system recognizes the speaker, and if ad- 
aptation has already been conducted for that speaker, 
the models already existing will be used for further ad- 
aptation. Speaker verification techniques are used for 
recognizing who is speaking 
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[001 7] According to the present invention this change 
in speaker is detected automatically. Therefore, in a net- 
worked system that is mainly used by the same persons, 
but with a frequent change between them, the speech 
recognition system according to the present invention 
does not restart the adaptation to a different speaker 
every time the speaker changes, but it first checks the 
identity of the speaker, so that the system can switch to 
an adapted model set for this particular speaker, if it ex- 
ists. In this case, said model set is stored and used for 
recognition and further adaptation. Together with the 
speaker adapted model set; the statistical hyper param- 
eters necessary for the adaptation are stored so that the 
adaptation can continue and does not have to be restart- 
ed when the same speaker uses the system again. Such 
hyper parameters could e.g. be weights that determine 
the adaptation speed to adapt a certain speaker adapt- 
ed model set to the corresponding speaker. If no model 
set exists for this particular speaker, a new one will be 
built using adaptation starting with the SI models. 
[0018] The method and device according to the 
present invention will be better understood from the fol- 
lowing detailed description of an exemplary embodi- 
ment thereof taken in conjunction with the appended 
drawings, wherein: 

Figure 1 shows a speech recognition system ac- 
cording to the present invention using 
speaker adaptation and automatic identifi- 
cation of the speaker; and 

Figure 2 shows the verification and adaptation pro- 
cedure performed according to the present 
invention. 

[0019] Figure 1 only shows the part of the automatic 
speech recognition system according to the present in- 
vention that is used for speaker adaptation and auto- 
matic identification of the speaker. 
[0020] The analogue speech signal generated by a 
microphone t is converted into a digital signal in an A/ 
D conversion stage 2 before a feature extraction is per- 
formed by a feature extraction module 3 to obtain a fea- 
ture vector, e.g. every 10 ms. This feature vector is fed 
into a verification module 4 and a recognition module 5. 
In the verification module 4 an automatic identification 
of the speaker is performed, as described above. In the 
recognition module 5 recognition of the spoken utter- 
ance is performed on basis of the extracted feature vec- 
tors and a set of HMM models. The recognition module 

5 also feeds the recognition result to an adaptation mod- 
ule 6 that can adapt a certain HMM model set to a certain 
speaker. 

[0021] The HMM model set to be accessed or adapt- 
ed by the recognition module 5 or the adaptation module 

6 is selected by the verification module 4 from a speaker 
independent model set or one of several sets of speaker 
adapted model sets that are respectively adapted to dif- 
ferent individual speakers. These different model sets 



4 

are stored in storages 7, 8, 9 and 10 and selected via a 
switch 1 1 that has its fixed terminal connected to the rec- 
ognition module 5 and the adaptation module 6 and the 
movable terminal dependent on a control signal that is 
s received from the verification module 4 to one of the 
model sets described before. 

[0022] It is also possible that the speaker adapted 
model sets are not adaptated to individual speakers, but 
to individual groups of speakers, such as Germans, Brit- 
io ish people, Germans speaking English, American peo- 
ple and so on or people speaking different dialects. 
These groups can also be identified automatically ac- 
cording to well known language or dialect identification 
algorithms working directly on the speech signal. 
[0023] Of course, instead of the switch 11 a different 
solution having the same function can be selected. 
[0024] Figure 2 shows the verification and adaptation 
procedure performed in the recognition system accord- 
ing to the present invention. In a first step S1 a spoken 
utterance of a user is received, A/D converted and fur- 
ther processed to extract the feature vectors. Thereaf- 
ter, it is checked in a step S2 whether a new speaker is 
talking or not. If a new speaker is talking, it is checked 
in step S3 whether an adapted model set already exists 
for this speaker or not. If an adapted model set already 
exists this model set is used for further adaptation in a 
step S4, whereafter the next spoken utterance is proc- 
essed in step S1 and the whole procedure is repeated 
therewith. 

[0025] If no adapted model set exists in step S3, ad- 
aptation with the speaker independent model is started 
in step S6 and a new model set (speaker adapted) is 
added to the system, whereafter the next utterance is 
processed in step SI and the whole process will be re- 
peated with his next utterance. If it is determined in step 
S2 that no new speaker is talking, the adaptation will be 
done with the current model set in step S5, whereafter 
the next spoken utterance is processed in step S1 and 
the whole procedure will be repeated with this next ut- 
terance. 



Claims 

-*5 1. Method to perform an automatic speech recogni- 
tion, characterized in that 

a change of the speaker is detected automati- 
cally; 

50 a speaker gets identified; and 

an individual model set adapted to the identified 
speaker is used for the speech recognition pro- 
cedure, if it is available, otherwise such an in- 
dividual speaker adapted model is newly gen- 
55 erated for said speaker. 

2. Method according to claim 1 , characterized in that 
an individual speaker adapted model set is gener- 
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ated on basis of a speaker independent model. 

3. Method according to claim 1 or 2, characterized in 
that an individual speaker adapted model set is ad- 
aptated on basis of utterances of the corresponding s 
speaker. 

- 4. Method according to claim 1 , 2 or 3, characterized 
in that an individual speaker adapted model set is 
adaptated on basis of hyper parameters of the cor- io 
responding speaker. 

5. Method according to anyone of claims 1 to 4, char- 
acterized in that the speech recognition is per- 
formed on basis of Hidden Markov Models. is 

6. Recognition system, comprising 

a microphone (1 ) to receive spoken words of a 
user and to output an analog signal; 20 
an A/D conversion stage (2) connected to said 
microphone (1 ) to convert said analog signal in- 
to a digital signal; 

a feature extraction module (3) connected to 
said A/D conversion stage (2) to extract feature 2s 
vectors of said received words of the user from 
said digital signal; 

a recognition module (5) connected to said fea- 
ture extraction module (3) to recognize said re- 
ceived words of the user on basis of said tea- 30 
ture vectors; 

an adaptation module (6) receiving the recog- 
nition result from said recognition module (5) to 
generate and/or adaptate a speaker adapted 
model set; 35 

characterized by 

a verification module (4) identifying a new 
speaker and selecting an individual speaker adapt- 
ed model set that forms the basis for speech recog- 40 
nition of said identified speaker and model adapta- 
tion to said identified speaker. 

7. Recognition system according to claim 6, charac- 
terized by a storage (7, 8, 9, 10) for a speaker in- 45 
dependent model set and each individual speaker 
adapted model set including the adaptation hyper- 
parameters. 

8. Recognition system according to claim 7, charac- so 
terized in that respective adaptation hyper param- 
eters are stored in a storage (8, 9, 1 0) of a corre- 
sponding individual speaker adapted model set. 

55 
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